Skip to content

prov/shm: two performance fixes for the inject path#1

Closed
yinliaws wants to merge 2 commits intozachdworkin:lockfrom
yinliaws:shm-inject-perf-fixes
Closed

prov/shm: two performance fixes for the inject path#1
yinliaws wants to merge 2 commits intozachdworkin:lockfrom
yinliaws:shm-inject-perf-fixes

Conversation

@yinliaws
Copy link
Copy Markdown

@yinliaws yinliaws commented May 5, 2026

Stacked on top of: ofiwg#12109

Summary

PR ofiwg#12109 moved the SHM inject buffer freestack from the sender's region into
the peer's region, protected by a new peer_smr->fs_lock. This was needed
for correctness but caused measurable throughput regressions of -17% to
-49%
versus libfabric main at 512 B–4 KB messages on fi_rdm_tagged_bw.

This series adds two focused fixes that recover the regression:

  1. prov/shm: replace inject index divq with a shift - replaces a 64-bit
    division in smr_format_inject() with a one-cycle shift. Independent of
    this series; applies to main-branch libfabric too. Largest impact on Arm.

  2. prov/shm: add per-peer batched inject buffer cache - amortizes
    peer_smr->fs_lock over batches of 32 buffers. Fast path is lock-free
    (cache array + counter), slow path takes the lock once and pops 32 buffers.

yinliaws added 2 commits May 4, 2026 21:16
smr_format_inject() calls smr_freestack_get_index() on every inject
send to populate cmd->data.inject_buf_index. That helper computes

    index = (pointer - base - entry_base_offset) / fs->object_size

where fs->object_size is a runtime struct field, so the compiler emits
a divq instruction (~20-30 cycles on x86, ~50+ cycles on Arm).

However for the inject pool specifically, object_size is
sizeof(struct smr_inject_buf) == SMR_INJECT_SIZE == 4096, which is a
compile-time constant and a power of two. Inline the computation here
and replace the division with a 1-cycle shift.

perf annotate showed smr_freestack_get_index's divq accounting for
~45% of smr_do_inject samples at 4KB BW on AMD Zen 3. The shift
replacement preserves correctness; no other behavior changes.

Measured impact (fi_rdm_tagged_bw, 5-round median vs libfabric main):

    Platform  Size  Before   After
    Graviton  2K    -27%    +15%   (+42 pp)
    Graviton  4K    -21%     +7%   (+28 pp)
    P5EN      2K     +7%    +20%   (+13 pp)
    P5EN      4K     +3%    +10%   (+7 pp)

Arm benefits the most because the ARMv8 integer divider is slower than
x86's.
The peer-owned inject pool design requires every inject send to acquire
peer_smr->fs_lock to pop a buffer from the peer's freestack. Under
high send rates (fi_rdm_tagged_bw with window=2000), this lock becomes
the dominant cost for small-to-medium messages (512B-4KB), producing
regressions of -17% to -49% versus libfabric main.

Amortize the lock by caching SMR_INJECT_CACHE_BATCH (32) buffers
per-peer in the sending endpoint. The fast path (cache hit) is a
single array load plus counter decrement with no atomic operations.
The slow path (cache empty) acquires the peer lock once and batch-pops
up to BATCH buffers in a single critical section.

Data structures:
  - struct smr_inject_cache: fixed-size array of buffer pointers + count
  - struct smr_ep::inject_cache[SMR_MAX_PEERS]: one cache per peer
    (placed at end of smr_ep to preserve cacheline layout of hot fields)

On endpoint close, cached buffers are pushed back to their peers'
freestacks under the appropriate lock so nothing leaks.

Memory cost: SMR_INJECT_CACHE_BATCH(32) * SMR_MAX_PEERS(256) * 8
= 64KB per endpoint, plus 256*4 bytes for the count fields.

Measured impact (fi_rdm_tagged_bw, 5-round median vs libfabric main,
PR ofiwg#12109 baseline shown for comparison):

    Platform   Size   PR ofiwg#12109   This patch
    Graviton   512B    -28%       +23%
    Graviton    1K     -32%       +20%
    Graviton    2K     -27%       +15%
    P4D        256B    -23%       +60%
    P4D         1K     -40%        +9%
    P5EN       256B    -34%       +52%
    P5EN        2K     -39%        +8%

Combined with the previous shift commit, regressions at 256B-2K are
fully recovered across Graviton, AMD, and Intel platforms tested.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant